To unify the data and prevent false sorting, leading zeros were added to all file names that were missing them.
The dataset consisted of 154 files containing either a forward- or a reverse read. These were individually imported into Gear Genomic's Pearl tool. A consensus sequence was created for each of them, resulting in a total of 76 DNA sequences. 44 of the consensus sequences had to be edited manually, for the rest, the automatically generated consensus sequences could be used without modification.
Manual editing included the following steps:
For those sequences that required modification, two files were exported: both the original automatic consensus sequence and the manually corrected one (marked either with -a or -m in the file name).
A small bash script (edit_consensus.sh) was written to rename the FASTA headers of the exported sequences. This is necessary because the files exported by Pearl all use the header '>user sequence', which is not informative. After running this script on the exported files, that useless header was replaced with the sequence name. The script also combines all fasta files into a single file.
# set the working directory from script input
work_dir=$1
cd $work_dir
# loop over every .fa file in the WD and replace 'user sequence' with the sequence name
for file in $(ls *.fa); do
# get the sequence name from the file name
seq_name=$(basename $file .fa)
echo $seq_name
# replace sequence headers
sed -i -e "s/user sequence/${seq_name}/g" $file
done
# remove all .fa-e files generated by SED
rm *.fa-e
# generate a new file for the multifasta
multifasta=$(echo seq_consensus-$(echo $(basename $(ls *.fa | head -n 1) .fa) | cut -d "-" -f 2).fasta)
echo > $multifasta
# concat all fasta files in the WD into a multifasta
for file in $(ls *.fa); do
cat $file >> $multifasta
echo >> $multifasta
done
The combined multifasta is imported into Seaview. For each sequence, a reverse complement sequence is created, these are then exported into a new file seq_complement.fst, which is again opened in Seaview. The genetic code for all sequences is initially set to 'Invertebrate Mitochondrial'.
By repeatedly checking the proteins that the sequences code for and adjusting the reading frame if necessary, the sequences are modified to not include any stop codons. Some sequences had stop codons in them, no matter which reading frame was chosen:
This issue could be resolved by going back to Pearl and re-evaluating the conflicts. A quick BLAST search (see section two) revealed that some of the sequences were related to non-invertebrates - the genetic code for those sequences was set to 'Vertebrate Mitochondrial'. Now, only the following sequences were still showing stop codons:
All sequences with their adjusted reading frames were saved into a multifasta file.
After all sequences have been adjusted in Seaview, the new multifasta can be uploaded on NCBI's Nucleotide BLAST website. All settings were left on standard. This gives information on what species the sequences are likely from.
For every sequence, the top result was taken from the Taxonomy / Organism overview. If there were multiple results that were very close in their highest score, both were noted down. The results of the BLAST search can be seen in the table below.
Table 1 - Results of the BLAST search for all samples. Input for the search were complement sequences of input dataset.
In total, 22 different species were found.
| Sequence | Taxon | Higher Taxon | Per. Ident max | Query Cover | Notes |
|---|---|---|---|---|---|
| P1_01 | Acartia bifilosa | Crustacea | 99,31% | 84,00% | |
| P1_04 | Acartia tonsa | Crustacea | 100,00% | 79,00% | |
| P1_07 | Acartia bifilosa | Crustacea | 99,09% | 77,00% | |
| P1_09 | Pseudodiaptomus marinus | Crustacea | 99,84% | 77,00% | |
| P1_10 | Acartia bifilosa | Crustacea | 99,84% | 80,00% | |
| P1_11 | Acartia tonsa | Crustacea | 100,00% | 77,00% | |
| P1_12 | Acartia tonsa | Crustacea | 100,00% | 78,00% | |
| P1_13 | Mesopodopsis slabberi | Crustacea | 99,34% | 53,00% | |
| P1_18 | Monocorophium acherusicum | Crustacea | 100,00% | 77,00% | |
| P1_19 | Acartia bifilosa | Crustacea | 99,09% | 92,00% | |
| P1_20 | Pseudodiaptomus marinus | Crustacea | 99,85% | 79,00% | |
| P1_22 | Acartia tonsa | Crustacea | 100,00% | 77,00% | |
| P1_24 | Acartia bifilosa | Crustacea | 99,24% | 84,00% | |
| P1_33 | Acartia bifilosa | Crustacea | 99,24% | 79,00% | |
| P1_34 | Acartia tonsa | Crustacea | 100,00% | 78,00% | |
| P1_35 | Acartia bifilosa | Crustacea | 99,24% | 74,00% | |
| P1_36 | Acartia tonsa | Crustacea | 100,00% | 77,00% | |
| P1_37 | Acartia tonsa | Crustacea | 100,00% | 77,00% | |
| P1_38 | Acartia tonsa | Crustacea | 100,00% | 78,00% | |
| P1_39 | Acartia bifilosa | Crustacea | 100,00% | 80,00% | |
| P1_40 | Acartia tonsa | Crustacea | 100,00% | 77,00% | |
| P1_41 | Acartia tonsa | Crustacea | 100,00% | 78,00% | |
| P1_43 | Temora longicornis | Crustacea | 100,00% | 76,00% | |
| P1_44 | Pseudodiaptomus marinus | Crustacea | 100,00% | 69,00% | |
| P1_45 | Acartia tonsa | Crustacea | 100,00% | 76,00% | |
| P1_46 | Ensis directus | Mollusca | 100,00% | 77,00% | |
| P1_47 | Acartia bifilosa | Crustacea | 100,00% | 77,00% | |
| P1_48 | Acartia bifilosa | Crustacea | 98,68% | 92,00% | |
| P2_01 | Pseudodiaptomus marinus | Crustacea | 100,00% | 68,00% | |
| P2_02 | Acartia tonsa | Crustacea | 100,00% | 77,00% | |
| P2_04 | Acartia tonsa | Crustacea | 100,00% | 77,00% | |
| P2_05 | Acartia tonsa | Crustacea | 100,00% | 78,00% | |
| P2_07 | Acartia bifilosa | Crustacea | 99,09% | 75,00% | |
| P2_09 | Acartia bifilosa | Crustacea | 100,00% | 80,00% | |
| P2_10 | Acartia bifilosa | Crustacea | 99,24% | 87,00% | |
| P2_12 | Acartia bifilosa | Crustacea | 99,24% | 78,00% | |
| P2_17 | Acartia tonsa | Crustacea | 100,00% | 78,00% | |
| P2_18 | Acartia tonsa | Crustacea | 100,00% | 78,00% | |
| P2_20 | Centropages typicus | Crustacea | 99,64% | 66,00% | |
| P2_29 | Pseudodiaptomus marinus | Crustacea | 100,00% | 74,00% | |
| P2_30 | Acartia bifilosa | Crustacea | 99,24% | 97,00% | |
| P2_31 | Pseudodiaptomus marinus | Crustacea | 100,00% | 77,00% | |
| P2_32 | Pseudodiaptomus marinus | Crustacea | 99,84% | 74,00% | |
| P2_34 | Pseudodiaptomus marinus | Crustacea | 99,85% | 78,00% | |
| P2_35* | Centropages typicus | Crustacea | 99,46% | 65,00% | |
| P2_39 | Ensis directus | Mollusca | 100,00% | 53,00% | |
| P2_40 | Acartia bifilosa | Crustacea | 99,09% | 78,00% | |
| P2_41 | Acartia tonsa | Crustacea | 100,00% | 76,00% | |
| PM2_02 | Parasagitta setosa | Chaetognatha | 99,39% | 76,00% | |
| PM2_06 | Liocarcinus holsatus | Crustacea | 100,00% | 79,00% | ambiguous |
| PM2_06 | Polybius henslowii | Crustacea | 100,00% | 79,00% | ambiguous |
| PM2_07 | Liocarcinus holsatus | Crustacea | 99,85% | 76,00% | ambiguous |
| PM2_07 | Polybius henslowii | Crustacea | 99,85% | 76,00% | ambiguous |
| PM2_08 | Upogebia deltaura | Crustacea | 99,54% | 77,00% | |
| PM2_09 | Upogebia deltaura | Crustacea | 100,00% | 70,00% | |
| PM2_10 | Upogebia deltaura | Crustacea | 99,85% | 82,00% | |
| PM2_11 | Ditrichocorycaeus anglicus | Crustacea | 99,85% | 74,00% | |
| PM2_13 | Ditrichocorycaeus anglicus | Crustacea | 99,38% | 76,00% | |
| PM2_16 | Eutrigla gurnardus | Actinopterygii | 99,69% | 74,00% | |
| PM2_17 | Scomber scombrus | Actinopterygii | 99,54% | 77,00% | |
| PM3_01 | Pullosquilla litoralis | Crustacea | 85,31% | 98,00% | |
| PM3_02 | Pullosquilla litoralis | Crustacea | 85,47% | 99,00% | |
| PM3_03 | Liocarcinus holsatus | Crustacea | 100,00% | 76,00% | ambiguous |
| PM3_03* | Polybius henslowii | Crustacea | 100,00% | 75,00% | ambiguous |
| PM3_04 | Liocarcinus holsatus | Crustacea | 99,70% | 77,00% | |
| PM3_06 | Isias clavipes | Crustacea | 100,00% | 77,00% | |
| PM3_07 | Isias clavipes | Crustacea | 99,85% | 78,00% | |
| PM3_08 | Isias clavipes | Crustacea | 99,70% | 77,00% | |
| PM3_09 | Acartia clausii | Crustacea | 99,83% | 80,00% | |
| PM3_20 | Ditrichocorycaeus anglicus | Crustacea | 100,00% | 77,00% | |
| PM3_23 | Arnoglossus laterna | Actinopterygii | 99,69% | 79,00% | |
| PM4_04 | Upogebia deltaura | Crustacea | 99,24% | 78,00% | |
| PM4_06* | Centropages typicus | Crustacea | 98,75% | 67,00% | |
| PM4_09 | Temora longicornis | Crustacea | 100,00% | 85,00% | |
| PM4_10 | Paracalanus parvus | Crustacea | 100,00% | 76,00% | |
| PM10_01 | Ditrichocorycaeus anglicus | Crustacea | 100,00% | 77,00% | |
| PM10_02 | Ditrichocorycaeus anglicus | Crustacea | 99,70% | 77,00% | |
| PM10_03 | Acartia clausii | Crustacea | 100,00% | 86,00% | |
| SR_303 | Scutigerellidae | Arthropoda | 99,36% | 100,00% | suspicious |
| SR_303 | Homo sapiens | Mammalia | 99,35% | 99,00% | suspicous |
Most of the sequences in the dataset were identified as Crustaceans. Some were not, notably the following ones:
Three sequences were ambiguous, with two species being identified with the highes score. For all three, it could either be Liocarcinus holsatus or Polybius henslowii:
PM2_06
PM2_07
PM3_03
SR_303 was a strong outlier, being either a member of the family Scutigerillidae (Arthropoda) or of the species Homo sapiens (Mammalia).
Before starting the phylogenetic analysis, the sequences in the multifasta were sorted. The three fish, the molluscs and the chaetognath were moved down to improve readability.
After the initial alignment using MUSCLE, a pattern became clear: the three sequences P2_35, PM3_03 and PM4_06 did not fit the rest of the sequences. For some reason, they were much longer and created large gaps in the alignment.
Figure 1 - selective detail of the full alignment.
Removing these sequences from the alignment improved it by a lot.